Prediction of Morbidity and Mortality After Esophagectomy: A Systematic Review

Background Esophagectomy for esophageal cancer has a complication rate of up to 60%. Prediction models could be helpful to preoperatively estimate which patients are at increased risk of morbidity and mortality. The objective of this study was to determine the best prediction models for morbidity and mortality after esophagectomy and to identify commonalities among the models. Patients and Methods A systematic review was performed in accordance to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement and was prospectively registered in PROSPERO (https://www.crd.york.ac.uk/prospero/, study ID CRD42022350846). Pubmed, Embase, and Clarivate Analytics/Web of Science Core Collection were searched for studies published between 2010 and August 2022. The Prediction model Risk of Bias Assessment Tool was used to assess the risk of bias. Extracted data were tabulated and a narrative synthesis was performed. Results Of the 15,011 articles identified, 22 studies were included using data from tens of thousands of patients. This systematic review included 33 different models, of which 18 models were newly developed. Many studies showed a high risk of bias. The prognostic accuracy of models differed between 0.51 and 0.85. For most models, variables are readily available. Two models for mortality and one model for pulmonary complications have the potential to be developed further. Conclusions The availability of rigorous prediction models is limited. Several models are promising but need to be further developed. Some models provide information about risk factors for the development of complications. Performance status is a potential modifiable risk factor. None are ready for clinical implementation. Supplementary Information The online version contains supplementary material available at 10.1245/s10434-024-14997-4.

Esophageal cancer is the sixth leading cause of cancer death worldwide. 13][4][5] These postoperative complications are associated with significant morbidity, mortality, and health economic effects.[8][9][10][11] Early identification of patients at high risk of severe complications has three potentially important healthcare benefits.First, patients at high risk of complications or death can be better informed about potential adverse consequences of surgery, which may lead to alternative treatment strategies.Second, potential preventative measures can be tailored to the risk profile by influencing potentially modifiable risk factors.Third, patients at the highest risk levels can be monitored more closely (for example, using remote monitoring or high-care ward admission) for early detection and treatment of complications-interventions that might not be costeffective for the whole population.
A large number of preoperative prediction models have been developed in recent years on morbidity and mortality after esophagectomy.These models could potentially be helpful in identifying high-risk patients, but their usefulness has not yet been assessed systematically.][14][15][16] The primary aim of this study was to evaluate which of the existing prediction models are most suitable for potential widespread implementation.To evaluate the potential usefulness and readiness for clinical practice, we integrated results of models' predictive performance with methodological quality assessment and availability of the input variables.
The secondary aim was identification of commonalities among the best-performing models, in which the focus was on models predicting mortality and pulmonary complications.

Study Design
This is a systematic review.The conduct and reporting of this review adhere to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (www.prisma-state ment.org) and was prospectively registered in PROSPERO (https:// www.crd.york.ac.uk/ prosp ero/, study ID CRD42022350846). 17

Literature Search Strategy
Three bibliographic databases, PubMed, Embase.com, and Clarivate Analytics/Web of Science Core, were searched for relevant literature from inception to 25 August 2022.Searches were devised in collaboration with a medical information specialist (KAZ).Search terms, including synonyms, closely related words, and keywords, were used as index terms or free-text words: 'esophagectomy' and 'prediction'.No methodological search filters, date, or language restrictions were applied that would limit results.
ASReview (version 1.0) was used to rank potentially relevant titles and abstracts.Screening in ASReview was carried out independently by two reviewers (MPvNA and GLV).All references marked as relevant were manually screened for eligibility by both reviewers.If necessary, the full-text article was checked for the eligibility criteria.Differences in judgement were resolved through a consensus procedure.If no consensus was reached, a third reviewer was consulted (PRT).
The full search strategy is detailed in Supplementary Material 1.

Eligibility Criteria
Studies were included in which prognostic models/scales/ indexes were developed and/or validated with respect to the preoperative prediction of morbidity (Clavien-Dindo score of at least 3) and/or mortality within 90 days after esophagectomy owing to esophageal cancer (regardless of histology or surgery type). 18All types of prediction modeling studies were included.We excluded articles written before 2010.To assess models that can be used today, it is desirable that the study population match the current patient population as much as possible.Models consisting of one type of variable (such as blood markers, nutritional status, or cardiopulmonary exercise testing) were excluded.Articles that only examined association and/or correlation between the score of a model and morbidity/mortality were excluded.
To compare the accuracy of models, only models that examined accuracy and reported an outcome measure, such as area under the receiver operating characteristic curve (AUC) or observed/expected ratio (O/E ratio), were included.For more details about the inclusion and exclusion criteria, see Supplementary Material 2. Outcome definitions were described in Supplementary Material 3.

Assessment of Methodological Quality
Two reviewers (MPvNA and GLV) independently assessed methodological quality of full-text papers using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). 19This tool, especially designed for systematic reviews of prediction models, assesses the risk of bias in four domains (participants, predictors, outcome, and statistical analysis) and addresses the concerns of applicability in three domains (participants, predictors, and outcome).A domain was assessed as low risk when all signaling questions were answered yes or probably yes.A domain was assessed as high risk when at least one signaling question in that domain was answered no or probably no.Overall risk of bias was assessed as low when all domains were considered low risk.Overall risk of bias was assessed as high when at least one domain was considered high risk.For domain one, participants, applicability is scored as unclear if it is unclear how many patients received neoadjuvant chemoradiation or if less than half the patients received neoadjuvant therapy.
When multiple models were developed and/or validated in a single study, a separate PROBAST form for each model or for both development and validation was needed.However, if results were completely similar, then this is reflected as one result in the Supplementary Material.Prediction of Morbidity and Mortality …

Data Extraction
Data extraction of the identified studies was performed using the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) checklist (MPvNA). 20Extracted data consisted of study characteristics (first author, country, study type, pretreatment, surgery type, and cohort years), study outcomes (outcome, number of events/sample size, outcome measures used regarding discrimination, and calibration), and the variables used in the different models.

Data Synthesis and Statistical Analysis
The results were tabulated and a narrative synthesis was performed.
For prognostic accuracy (e.g., discrimination) the AUC value was often used.An AUC value under 0.60 was rated as poor, a value between 0.60 and 0.75 as possibly helpful discrimination, and more than 0.75 as useful discrimination. 21or calibration, the Hosmer-Lemeshow test was often used.Small p-values mean that the model has poor calibration.
To assess potential usefulness and readiness for clinical practice, we integrated the results of the methodological quality assessment with the predictive performance of the models (lower limit confidence interval AUC), presence of external validation, sample size, and availability of the input variables.We rated the models ranked high in the tables as the better models.
Commonalities of the models (predictor variables) were presented in a figure and quality assessments were transformed into figures using Rstudio, version 4.2.1.

Systematic Search
Details regarding the literature search are shown in Fig. 1.Of the original 15,011 references identified by the search, 22 articles met the inclusion criteria and were included in this systematic review, using data from 108,208 patients.

Study Characteristics
Study characteristics are presented in Table 1.Seven studies were conducted in Asia, seven in Europe, six studies in the USA, and two studies were intercontinental studies.In general, data were retrospectively collected from an existing database.Of the 22 studies, 12 collected their data entirely from 2010 onward, and the remaining studies partly after 2010.Eleven studies included a population in which at least half of the patients were pretreated with chemotherapy, radiotherapy, chemoradiotherapy, or immunochemotherapy.
Most articles described the development of one or more new models or described the validation of one or more existing models.We assessed 39 models, including 8 models in the study by Ohkura et al. 22 From this study, only the two most relevant models were used in our analysis (anastomotic leakage and pneumonia).Finally, 33 models were included in this systematic review, of which 18 models were newly developed.

Quality Assessment
The overall risk of bias (ROB) and overall concerns regarding applicability are presented in Table 2.For more details about ROB and concerns regarding applicability, see Supplementary Material 4.
Risk of bias: • Participants: For the development and validation of the models, retrospective cohort and/or existing (national) databases were generally used.Most studies showed a low risk of bias at this point.• Predictors: There does not appear to be a risk of bias in any of the studies.• Outcome: A few studies had unclear risk of bias owing to unclear definitions of outcome.• Analysis: Most studies had a high risk of bias, in developmental studies mainly owing to an insufficient event-to-variable rate (candidate variable), and in validation studies mainly owing to an insufficient event rate (less than 100 events).Development studies on mortality have been carried out in large populations.This is in contrast to some development studies related to pulmonary complications.With the exception of the validation study by D'Journo et al., the studies validating only preexisting models were carried out in small populations. 23ther common causes of risk of bias in development studies were lack of relevant model performance measures or lack of correction for overfitting or optimism.
A model may be overfit when it makes good predictions on the study sample (owing to certain typical factors in the study population) but poor predictions outside of the study sample.This can be corrected through techniques such as bootstrapping or cross-validation.
Concerns regarding applicability: Concerns regarding applicability were in general related to no information whether neoadjuvant chemoradiation was given or because there were relatively few people in the population that underwent neoadjuvant treatment with chemoradiation.

Discrimination, Calibration, and Validation
Study outcomes are presented in Table 2.More detailed information can be found in Supplementary Material 5.

Discrimination
Discrimination of models for mortality: Only Takeuchi's models had a prognostic accuracy of about or greater than 0.75; Sasaki's models were just below that. 24,25The remaining models had accuracies between 0.60 and 0.75.7][28][29] Other models had poor performance, including the aCCI validated in studies other than Filip's. 27,30,31iscrimination of models for pulmonary complications: The development of Wang's model, the validation of Thomas' model, and the Ferguson model found accuracies above 0.75.3][34] The remaining models found accuracies between 0.60 and 0.75.Discrimination of models for anastomotic leakage: All three newly developed models found accuracies between 0.53 and 0.63.

Calibration
Calibration was reported most as a non-significant Hosmer-Lemeshow test 6 or a figure such as a scatterplot or calibration plot. 5Eight studies did not report on calibration and one study indicated a favorable correlation between predicted and observed events, but the data were not shown. 24or more details about calibration, see Supplementary Material 5.

Validation
None of the 18 newly developed models were validated by another research group, 14 models were validated by the author's own research group (in a new population or by bootstrapping), and 4 models were developed but not validated.Additionally, 11 existing models were validated one or more times.For more details, see Supplementary Material 6.

Predictor Variables
An overview of the different predictor variables is presented in Fig. 2A,B.
For the prediction models on mortality, 52 different variables were used with a median of 10 variables per model (range 4-20).Most studies used eight or more predictor variables.The easiest model to use is Steyerberg's model with just four easily available predictor variables. 23The predictor variables could be classified as patient characteristics, medical history, tumor and treatment, test results, or other.Age, sex, and performance status were the predictor variables most used regarding patient characteristics.ASA/comorbidity in general, (congestive) heart failure, and preoperative dialysis/renal dysfunction were the predictor variables most used regarding medical history.Histology, N-classification, and cancer metastasis/relapse were most used regarding tumor and treatment.PT-INR, white blood cell count, and sodium were most used regarding test results.Hospital volume was included in three models.
For the pulmonary complication prediction models, 43 different variables were used with a median of 5 predictors per model (range 2-28).Most models used less than five predictors.Exceptions were Van Kooten's model regarding pulmonary complications which had 28 predictor variables, and Ohkura's model, which had 17 predictor variables. 22,35gain, variables could be classified as patient characteristics, medical history, tumor and treatment, or test results.Only age and histology were used in more than two models.
For most models, variables are easily available, making the models relatively easy to use (Table 2).Only the models of Ferguson (studied by Reinersman et al.) and Wang require a pulmonary function test, which is not performed routinely in all patients. 32,34

Assessment of the Best Models
On the basis of these results, we could conclude that there are a number of models that have the potential to be  24,36 On the basis of quality assessment, there is a risk of bias, but by weighing this against the other points of assessment (validation of a model in a sample separate from the development cohort, height of lower limit 95% confidence interval AUC, generalizability of the study, and sample size), these seem to be the better models.On the basis of the results, for models predicting pulmonary complications, the model by Thomas et al. is the best performing model. 33For anastomotic leakage, a model with potential has yet to be developed (partly owing to the fact that all three models had an AUC lower than 0.64).

DISCUSSION
The major findings of this systematic review assessing prediction models for complications after esophagectomy are that there are several models that are either promising to be further developed or provide us with the information about risk factors for the development of complications.Models with the most potential regarding prediction of mortality are the models by D'Journo and Takeuchi, while Thomas's model has the most potential regarding pulmonary complications.However, none of these three models have been validated by independent investigators yet.
Although it may be too early to implement complication prediction models in clinical practice, given the often relatively low AUC, the risk of bias, etc., the mortality models do at least provide us with relevant information about variables that influence the mortality risk.Common predictor variables in mortality models include age, sex, performance status, ASA score/comorbidity in general, and cancer metastasis/relapse.Of these factors, possibly only performance status could be influenced prior to surgery.This could mean two things: either a poor performance status could be examined preoperatively to see whether it could be improved before the esophagectomy, or performance status, if it is not yet, could be given a role, as with the non-influenceable factors age, sex, comorbidity, and cancer metastasis/relapse, in the preoperative assessment as to whether a surgery is the best option for the patient.In conversation with the patient, however, we should remain cautious regarding statements about the severity of the risk of complications.While we do know that a number of single factors can affect risk, we do not know what the exact level of risk is when multiple risk factors are present.
It is notable that in models for pulmonary complications, there is no uniformity in which variables should be included in models.More research is needed for this.
When examining model variables, it is noteworthy that not all models incorporate factors known to be linked with mortality and morbidity, such as surgical technique and hospital volume. 2,37-39A possible explanation is that surgical technique is not regarded as a preoperative variable, and hospital volume is not considered a patient-specific variable.Additionally, if the entire population comprises patients treated with the same technique, it is logical that the technique may not be included as a variable.Therefore, we recommend considering surgical technique as a potential variable in models for populations with diverse techniques when developing or revising a model.A recently published systematic review also focused on preoperative prediction models for complications after esophagectomy. 40While that study included all studies from 2000, our study focused on populations as close as possible to current patient populations in terms of neoadjuvant treatment, etc.We used a more robust quality assessment tool and provided a more detailed description of quality assessment. 19,41As a result, we rated more studies as high risk of bias.
Generalizability issues are a major risk in all assessed prediction studies, a problem that is inherent to the topic.We included a large number of prediction models and data of more than 100,000 patients.However, the low event rate of post-esophagectomy mortality (usually below 5%, and in large centers below 1%) substantially decreases the effective sample size available for risk factor identification and prediction modelling in each individual study.None of the validation studies reported a sample size calculation or used the simple rule of thumb of at least 100 events in the study population. 42,43Only the validation studies performed by D'Journo et al. and Wan et al. met the aforementioned rule of thumb. 23,44Additionally, very lengthy study periods to obtain a workable sample size (up to 15 years in some studies) can mean that some data are outdated by the time of publication, as diagnostics, operative techniques, and postoperative treatment protocols have changed.
None of the 18 developed models have been validated outside their own research group.This is a more widely known problem in prediction modeling, as only 15% of developed prediction models are externally validated. 45UC's in external validation studies are generally lower than in the development study and never increase by more than 0.03.This means that the real-world accuracy could not be assessed for any of the models.
Eight studies did not report on model calibration and one study indicated a favorable correlation between predicted and observed events. 24Calibration indicates the extent to which the predicted proportions of the event match the actually observed proportions of the event and is particularly important when a model will be used to support a decision.12

Strengths and Limitations
This study has several strengths.To our knowledge, this systematic review is the most recent and thorough systematic review on preoperative morbidity and mortality prediction models to date.Moreover, we registered our study at PROSPERO in advance.][48][49][50] One of the limitations of this study is that the pretreatment was not clearly stated, or the pretreatment turned out to be radiation, chemotherapy or immunochemotherapy, or chemoradiation was given in just a small proportion of patients.
In conclusion, the availability of rigorous prediction models is limited and none are ready for clinical implementation.Several models are promising but need to be further developed.In addition, some models provide us with the information regarding risk factors for the development of complications.Performance status is a potential modifiable risk factor when it comes to reducing risk of morbidity and mortality.

FIG. 1
FIG. 1 Flowchart of the study search and selection procedure

FIG. 2 A
FIG. 2 A Mortality predictor variables; B predictor variables of pulmonary complications

TABLE 1
Study characteristics

Table 1 (
Dev development, Val validation, Int internal, Ext external, NA not applicable, CTx chemotherapy, CRTx chemoradiotherapy, ICTx immunochemotherapy, MIE minimal invasive esophagectomy, OE open esophagectomy, TT transthoracal, TH transhiatal, TA thoracoabdominal, RAI risk analysis index, Rev revised, RAI-A administrative risk analysis index, mFI-5 5-factor modified frailty index, STS GTSD Society of Thoracic Surgeons General Thoracic Surgeons Database, aCCI age adjusted Charlson comorbidity index, CCI Charlson Comorbidity Index, ASA American Society of Anesthesiologists, O-POSSUM physiological and operative severity score for the enumeration of mortality and morbidity adjusted for oesophagogastric surgery, ACS NSQIP American College of Surgeons National Surgical Quality Improvement Program, ?unknown developed further.For models that predict mortality, the most promising models are the models by D'Journo et al. and Takeuchi et al.

TABLE 2
Summary of evaluation of prediction models for mortality and pulmonary complications

Table 2
(continued) *The studies by Takeuchi et al. and Raymond et al. described that calibration had been done, but no data were shown NA not available Prediction of Morbidity and Mortality … 16 Dia bet es me llitu s eso pha gea l var ice s live r dis eas e Pre ope rat ive dia lysi s/re nal dys fun ctio n Ca nce r Tra nsf er em erg enc y roo m cor ico ster oid use His tolo gy T-c las sifi cat ion N-c las sifi cat ion Neo adj uva nt trea tme nt Sur ger y typ e Ca nce r me tas tas is/r ela pse EC G Sys toli c blo odp res sur e Pul se rat e GC S Ha em ogl obi n Pla tele t AP TT PT -IN R WB C Sod ium Pot ass ium Alb um in BU N/u rea AST AL P CR P Ho spi tal vol um e